We need more Data!

world
news
machine learning
data
Author

George Charalambous

Abstract

This project could represent the minimal work conducted on underdeveloped countries and subsequently answer the question for the existence of sufficient data regarding underdeveloped countries, with the answer being a loud “NO”. Utilizing variables from The World Bank Data Repository, the project explores socio-economic indicators such as agricultural land, electricity access, fertility rate, internet usage, sanitation, population growth, primary school enrollment, and total unemployment. Despite challenges in data availability, logistic regression and neural network models are employed to predict a country’s development status. Initial attempts with logistic regression reveal limited success due to the complexity of relationships between variables. A neural network model is subsequently developed using a Collab Notebook, demonstrating promising results. The project concludes with insights into the need for more comprehensive and more data as a quantity, and the potential of advanced modeling techniques to address underdevelopment effectively.

Introduction

Being an intern with the Sub-Saharan Africa Poverty Team with the World Bank Group last summer, I realized that the lack of data regarding underdeveloped countries is something that needs to be addressed. This insufficient provision of data does not allow extensive research to be conducted and therefore lead to detailed solutions.

The purpose of this project is to “effectively” construct a classification model for underdeveloped countries, given a set of variables from The World Bank Data Repository. The initial selection of variables was indeed challenging, since most data sets would not contain enough data for underdeveloped countries, with NA values overwhelming the corresponding rows. Because of that, I decided to focus on the 21st century and ignore data before the year 2000. Two primary analytical approaches employed in this project are: logistic regression and neural networks.

Logistic regression, a classical statistical method, is utilized to model the probability of a country being classified as underdeveloped based on its socio-economic indicators. By fitting a logistic regression model to the data, the project seeks to identify the key variables that significantly influence a country’s development status. This approach provides interpretable results and insights into the relative importance of different indicators.

In addition to logistic regression, the project also explores the application of neural networks, a more advanced machine learning technique. Neural networks offer the advantage of capturing complex, nonlinear relationships between predictors and the response variable. By designing and training neural network models, the project aims to uncover nuanced patterns and interactions within the data that may not be captured by traditional linear models like logistic regression.

library(tidyverse)
library(here)
library(maps)
library(plotly)
library(broom)
library(colorspace)
world_df <- map_data("world")
electricity_access <- read_csv("data/access_to_electricity/access_to_electricity.csv", 
                        skip = 4) |>
  select(-c(3:44), -c(67:69))
countries <- c("Afghanistan", "Angola", "Bangladesh", "Benin", "Burkina Faso", "Burundi", "Cambodia", "Central African Republic", "Chad", "Comoros", "Congo, Dem. Rep.", "Djibouti", "Eritrea", "Ethiopia", "Gambia, The", "Guinea", "Guinea-Bissau", "Haiti", "Kiribati", "Lao PDR", "Lesotho", "Liberia", "Madagascar", "Malawi", "Mali", "Mauritania", "Mozambique", "Myanmar", "Nepal", "Niger", "Rwanda", "Sao Tome and Principe", "Senegal", "Sierra Leone", "Solomon Islands", "Somalia", "South Sudan", "Sudan", "Tanzania", "Timor-Leste", "Togo", "Tuvalu", "Uganda", "Yemen, Rep.", "Zambia")
electricity_stats <- electricity_access |>
  pivot_longer(c(3:24), 
               names_to = "Year", 
               values_to = "Electricity Access") |>
  filter(!is.na(`Electricity Access`)) |>
  group_by(`Country Name`) |>
  summarise(`mean_elec_access` = mean(`Electricity Access`, na.rm = TRUE),
            `sd_elec_access` = sd(`Electricity Access`, na.rm = TRUE))
underdeveloped_electricity <- electricity_access |>
  pivot_longer(c(3:24), 
               names_to = "Year", 
               values_to = "Electricity Access") |>
  filter(!is.na(`Electricity Access`)) 
underdeveloped_map <- left_join(underdeveloped_electricity, world_df, by = c("Country Name"="region"))
agricultural_land <- read_csv("data/agricultural_land/agricultural_land.csv", 
                       skip = 4) |>
  select(-c(3:44), -c(67:69))
agriculture_land_stats <- agricultural_land |>
  pivot_longer(c(3:24), 
               names_to = "Year", 
               values_to = "Agricultural Land") |>
  filter(!is.na(`Agricultural Land`)) |>
  group_by(`Country Name`) |>
  summarise(`mean_agr_land` = mean(`Agricultural Land`, na.rm = TRUE),
            `sd_agr_land` = sd(`Agricultural Land`, na.rm = TRUE))
population_growth <- read_csv("~/Desktop/Sixth Semester/ds334_final_project/ds334_final_project/data/population_growth_annual/population_growth.csv", 
                            skip = 4) |>
  select(-c(3:44), -c(67:69))
population_growth_stats <- population_growth |>
  pivot_longer(c(3:24), 
               names_to = "Year", 
               values_to = "Population Growth Rate") |>
  filter(!is.na(`Population Growth Rate`)) |>
  group_by(`Country Name`) |>
  summarise(`mean_pop_growth` = mean(`Population Growth Rate`, na.rm = TRUE),
            `sd_pop_growth` = sd(`Population Growth Rate`, na.rm = TRUE))
primary_school_enrol <- read_csv("~/Desktop/Sixth Semester/ds334_final_project/ds334_final_project/data/primary_school_enrollment/primary_school.csv", 
          skip = 4) |>
  select(-c(3:44), -c(67:69))
primary_school_enrol_stats <- primary_school_enrol |>
  pivot_longer(c(3:24), 
               names_to = "Year", 
               values_to = "Primary School Enrollment Rate") |>
  filter(!is.na(`Primary School Enrollment Rate`)) |>
  group_by(`Country Name`) |>
  summarise(`mean_prim_school` = mean(`Primary School Enrollment Rate`, na.rm = TRUE),
            `sd_prim_school` = sd(`Primary School Enrollment Rate`,na.rm = TRUE))
total_unemployment <- read_csv("~/Desktop/Sixth Semester/ds334_final_project/ds334_final_project/data/total_unemployment/total_unemployment.csv", 
          skip = 4) |>
  select(-c(3:44), -c(67:69))
total_unemployment_stats <- total_unemployment |>
  pivot_longer(c(3:24), 
               names_to = "Year", 
               values_to = "Total Unemployment") |>
  filter(!is.na(`Total Unemployment`)) |>
  group_by(`Country Name`) |>
  summarise(`mean_total_unempl` = mean(`Total Unemployment`, na.rm = TRUE),
            `sd_total_unempl` = sd(`Total Unemployment`, na.rm = TRUE))
sanitation <- read_csv("data/basic_sanitation_services/basic_sanitation.csv", 
          skip = 4) |>
  select(-c(3:44), -c(67:69))
sanitation_stats <- sanitation |>
  pivot_longer(c(3:24), 
               names_to = "Year", 
               values_to = "Sanitation") |>
  filter(!is.na(`Sanitation`)) |>
  group_by(`Country Name`) |>
  summarise(`mean_sanit` = mean(`Sanitation`, na.rm = TRUE),
            `sd_sanit` = sd(`Sanitation`, na.rm = TRUE))
fertility_rate <- read_csv("data/fertility_rate/fertility_rate.csv", 
          skip = 4) |>
  select(-c(3:44), -c(67:69))
fertility_rate_stats <- fertility_rate |>
  pivot_longer(c(3:24), 
               names_to = "Year", 
               values_to = "Fertility Rate") |>
  filter(!is.na(`Fertility Rate`)) |>
  group_by(`Country Name`) |>
  summarise(`mean_fert_rate` = mean(`Fertility Rate`, na.rm = TRUE),
            `sd_fert_rate` = sd(`Fertility Rate`, na.rm = TRUE))
internet <- read_csv("data/internet/internet.csv", 
          skip = 4) |>
  select(-c(3:44), -c(67:69))
internet_stats <- internet |>
  pivot_longer(c(3:24), 
               names_to = "Year", 
               values_to = "Internet") |>
  filter(!is.na(`Internet`)) |>
  group_by(`Country Name`) |>
  summarise(`mean_inter` = mean(`Internet`, na.rm = TRUE),
            `sd_inter` = sd(`Internet`, na.rm = TRUE))
birth_life_exp <- read_csv("data/life_expectancy_birth/life_expectancy_birth.csv", 
          skip = 4) |>
  select(-c(3:44), -c(67:69))
birth_life_exp_stats <- birth_life_exp |>
  pivot_longer(c(3:24), 
               names_to = "Year", 
               values_to = "Life Expectancy at Birth") |>
  filter(!is.na(`Life Expectancy at Birth`)) |>
  group_by(`Country Name`) |>
  summarise(`mean_life_exp` = mean(`Life Expectancy at Birth`, na.rm = TRUE),
            `sd_life_exp` = sd(`Life Expectancy at Birth`, na.rm = TRUE))
full_stats_df <- agriculture_land_stats |>
  left_join(birth_life_exp_stats, by = "Country Name") |>
  left_join(electricity_stats, by = "Country Name") |>
  left_join(fertility_rate_stats, by = "Country Name") |>
  left_join(internet_stats, by = "Country Name") |>
  left_join(sanitation_stats, by = "Country Name") |>
  left_join(population_growth_stats, by = "Country Name") |>
  left_join(primary_school_enrol_stats, by = "Country Name") |>
  left_join(total_unemployment_stats, by = "Country Name") 
full_stats_df <- 
  full_stats_df |>
  mutate(Underdeveloped = ifelse(`Country Name` %in% countries, 1, 0))

Variable Description

Variable Description
agriculture_land The share of land area that is arable, under permanent crops, and under permanent pastures
birth_life_exp The number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life
electricity_access The percentage of population with access to electricity
fertility_rate The number of children that would be born to a woman if she were to live to the end of her childbearing years and bear children in accordance with age-specific fertility rates of the specified year
internet The percentage of population that uses the internet
population_growth Annual population growth rate for year t is the exponential rate of growth of midyear population from year t-1 to t, expressed as a percentage
primary_school_enrol Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown
sanitation Percentage of people using at least basic sanitation services, that is, improved sanitation facilities that are not shared with other households
total_unemployment The share of the labor force that is without work but available for and seeking employment
# Confusion Matrix
#| output: false
#| echo: false
library(GGally)
ggpairs(data = full_stats_df, columns = c(2, 4, 6, 8, 10, 12, 14, 16, 18))

Logistic Regression Model Attempt

The first attempt of creating a classifier comes through the use of logistic regression models. This modelling method was chosen as the initial one, as it does an exceptional job with handling such diverse data sets resulting to high interpretability, computational efficiency, and flexibility.

The first step into selecting the variables for these models includes an exploratory analysis of our data set, utilizing a pairwise plot matrix, as well as looking at the full summary of the first implementation of the model to identify significant variables. Through this process, potential correlations and patterns were highlighted, giving an idea of the parameters that might be useful to include in our models. Moving on, this models would utilize a data set that includes the means of all the variables included in this data set, after they have been normalized and scaled for a universal baseline.

The models were trained on an augmented data set, generating variable ranges and applying the initial model to make predictions regarding the development status of a country. These variables’ ranges are used in combination with median values for the remaining predictors to construct a grid for prediction. This separation of median values and ranges is done for visualization and interpretation purposes. This grid represents a range of values for each variable of interest. With the logistic regression model trained on the selected variables, predictions are made on this grid to estimate the probability of a country being classified as underdeveloped.

# Get the means column for each variable
full_means_df <- full_stats_df |>
  select(contains("mean"))

numeric_columns <- sapply(full_means_df, is.numeric)

scaled_data <- scale(full_means_df[, numeric_columns])

scaled_data_with_undev <- scaled_data |>
  as.data.frame() |>
  mutate(Underdeveloped = full_stats_df$Underdeveloped) |>
  na.omit()
median_agr_land = median(scaled_data_with_undev$mean_agr_land, na.rm = TRUE)
median_life_exp = median(scaled_data_with_undev$mean_life_exp, na.rm = TRUE)
median_elec_access = median(scaled_data_with_undev$mean_elec_access, na.rm = TRUE)
median_fert_rate = median(scaled_data_with_undev$mean_fert_rate, na.rm = TRUE)
median_sanit = median(scaled_data_with_undev$mean_sanit, na.rm = TRUE)
median_population_growth = median(scaled_data_with_undev$mean_pop_growth, na.rm = TRUE)
median_primary_school = median(scaled_data_with_undev$mean_prim_school, na.rm = TRUE)
median_total_unempl = median(scaled_data_with_undev$mean_total_unempl, na.rm = TRUE)
median_internet = median(scaled_data_with_undev$mean_inter, na.rm = TRUE)
library(modelr)
## First Attempt
# Fit the logistic regression model with all the variables
model_glm <- glm(Underdeveloped ~ .,
                  data = scaled_data_with_undev, family = "binomial")

# Check the summary of the model
summary(model_glm)
# Identify numeric columns
numeric_cols <- sapply(scaled_data_with_undev, is.numeric)

# Compute range for each numeric column, removing NA values
ranges <- sapply(scaled_data_with_undev[, numeric_cols], function(x) {
  x <- na.omit(x)
  c(min(x), max(x))
})
underdeveloped <- scaled_data_with_undev |>
  filter(Underdeveloped == 1)

First Attempt

In this attempt, a logistic regression model is fitted using all available variables to predict whether a country is underdeveloped. The model is trained on the entire feature set, but for visualization and interpretation purposes, sequences of the mean life expectancy, mean fertility rate, and mean population growth are generated to explore their relationships. The remaining variables are held constant at their median values. Although the resulting graphic exhibits a smooth shape, it struggles to effectively capture the model’s fit. Notably, the varying lines representing different levels of mean electricity access show some discrepancies against the different mean fertility rates, suggesting distinctions among countries with very low or very high mean electricity access.

Second and Third Attempts

In these subsequent attempts, the models are trained using the entire feature set, but sequences of the mean life expectancy, mean fertility rate, and mean population growth are specifically generated for visualization and interpretation. Other variables are held constant at their median values. Despite the smooth appearance of the resulting graphics, the models struggle to fully capture the data’s complexity. Notably, when examining the relationship between mean fertility rate and predicted probability, variations across different levels of mean population growth and mean internet access are observed, indicating potential distinctions among countries with varying development statuses.

The visualizations from the plots above suggest that attempting to fit a Logistic Regression Model with different combinations of proxies does not yield satisfactory results. This indicates the potential limitations of using a simple model to capture the complexities of the data. Therefore, it might be beneficial to explore more complex modeling approaches that can better interpret the relationships between the various variables included in the analysis.

Neural Network Application

Due to the disappointing results from the application of logistic regression models, this next attempt incorporates a much more complex classification model used in machine learning, a neural network. It is a method that teaches computers to process data in a way that is inspired by the human brain. Due to the complexity of our data set, this model, by incorporating its entirety, might be able to capture the significant relationships among the variables and produce somewhat accurate results regarding the classification of countries regarding their development status.

The data is first split into training and testing sets. Then, a sequential neural network model is constructed with several dense layers, each employing a RELU activation function. The final layer uses a sigmoid activation function to produce binary classification predictions for an observation (Underdeveloped or Developed). The model is compiled with the Adam optimizer, binary cross-entropy loss function, and accuracy metric. Subsequently, the network is trained on the training data for 100 epochs. The training progress is monitored using the history object to track metrics such as accuracy and loss over epochs. After some experimenting with the hyperparameters, this models has reached a 95.65% accuracy, indicating a very good performance which can be seen below in the map.

The first map represents the actual classification of underdeveloped countries, filled with a blue color. The second map represents the predicted outcomes of our neural network, with the predicted underdeveloped countries highlighted with a dark red color.

Conclusion

While the maps above indicate that the network does an exceptional job given the classification task provided, the table below raises some concerns. According to the table below, the model classified 51 countries as underdeveloped, when there are only 42. On the other hand, the model classified 176 countries as developed, close to their real number of 185. This suggests that while the model performs relatively well in identifying developed countries, it struggles more with accurately classifying underdeveloped ones. Proportionally speaking, the model had a successful classification rate of 95.14% for developed countries, falling to 82.35% for underdeveloped ones. This result stresses the initial question this project was aimed to answer: THERE IS NOT ENOUGH DATA FOR UNDERDEVELOPED COUNTRIES. Therefore, with insufficient data to work with, the provision of accurate and precise solutions for countries in the underdeveloped category seems far from reality.

This project encourages people and myself included, to utilize it as a starting point and attempt to create a larger data set, covering a wider range of variables to produce a more accurate classifier. In addition to that, one can examine the sectors of a country that there is urgent need for more data and suggest strategies that would results to the consolidation of useful insights, leading to more targeted and sufficient research.

Summary of Actual and Predicted Classes
  Underdeveloped Developed
Actual 42 185
Predicted 51 176